Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a jobserver with N tokens, not N-1 #7344

Merged
merged 1 commit into from
Sep 11, 2019

Conversation

alexcrichton
Copy link
Member

I recently added jobserver support to the cc crate and ended up
running afoul of a jobserver quirk on Windows. Due to how it's
implemented, on Windows you can't actually add more than the intial
number of tokens to the jobserver (it uses an IPC semaphore). On Unix,
however, you can since you're just writing bytes into a pipe.

In cc, however, I found it convenient to control parallelism by simply
releasing a token before the parallel loop, then reacquiring the token
after the loop. That way the loop just has to acquire a token for each
job it wants to spawn and then release it when the job finishes. This is
a bit simpler than trying to juggle the "implicit token" all over the
place as well as coordinating its use. It's technically invalid because
it allows a brief moment of N+1 parallelism since we release a token
and then do a bit of work to acquire a new token, but that's hopefully
not really the end of the world.

In any case this commit updates Cargo's creation of a jobserver to create
it with N tokens instead of N-1. The same semantics are preserved
where Cargo then immediately acquires one of the tokens, but the
difference is that this "implicit token" can be released back to the
jobserver pool, unlike before.

I recently added `jobserver` support to the `cc` crate and ended up
running afoul of a `jobserver` quirk on Windows. Due to how it's
implemented, on Windows you can't actually add more than the intial
number of tokens to the jobserver (it uses an IPC semaphore). On Unix,
however, you can since you're just writing bytes into a pipe.

In `cc`, however, I found it convenient to control parallelism by simply
releasing a token before the parallel loop, then reacquiring the token
after the loop. That way the loop just has to acquire a token for each
job it wants to spawn and then release it when the job finishes. This is
a bit simpler than trying to juggle the "implicit token" all over the
place as well as coordinating its use. It's technically invalid because
it allows a brief moment of `N+1` parallelism since we release a token
and then do a bit of work to acquire a new token, but that's hopefully
not really the end of the world.

In any case this commit updates Cargo's creation of a jobserver to create
it with `N` tokens instead of `N-1`. The same semantics are preserved
where Cargo then immediately acquires one of the tokens, but the
difference is that this "implicit token" can be released back to the
jobserver pool, unlike before.
@rust-highfive
Copy link

r? @ehuss

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Sep 9, 2019
@ehuss
Copy link
Contributor

ehuss commented Sep 11, 2019

I'm trying to better understand the motivation for this change. I think I see how it allows cc to release its implicit token, but it looks like the cc crate already handles the case where it fails to release it. If it is already being handled, why change cargo?

I think this seems ok, I can't think of any problems, but this is always so tricky to get right.

Also, I assume from an error handling standpoint in the cc crate, that if it fails while running that it doesn't really matter that it doesn't re-acquire the token since cargo will be shutting down anyways?

I'm a little concerned that there's a window after all cc threads have finished that cargo fires off another job, and cc blocks at the end trying to reacquire its implicit token. It could block for a long time just so it can cleanly exit. I guess that doesn't matter since its not holding any job slots. Is that a correct understanding? And that this PR ensure that implicit release hack works and it doesn't need to reacquire at the end?

Another question: Should I update cc on rust-lang/rust? I considered using parallel in the past, but building rayon negated any benefits. It looks like it is still using an old version with rayon.

@alexcrichton
Copy link
Member Author

If it is already being handled, why change cargo?

You can see from the release history of cc that I had a few panicked releases in quick succession this past Saturday, realizing (roughly in order)

  • The build script has an implicit token so if we acquire a separate token for each piece of spawned work, we deadlock.
  • Releasing the token at the beginning is apparently broken on Windows
  • There's some typos, but then also cc just ignores errors to release the implicit token.

This is "handled" in cc but it's technically incorrect. This is only a problem on Windows where the actual cap of the jobserver is enforced (because it's a semaphore), but if run a build with -j2 and two build scripts start working then one will fail to release its token while the other will succeed. (I think at most one script will fail to release its token). This means that there's just one less token to acquire, so the build would actually run with 1 parallelism.

TBH I haven't though too too carefully about this, I was sort of hoping bugs in the wild would guide it and figured the bugs wouldn't be that bad. It's also something I assumed I could pretty easily fix in Cargo come Monday :)

this is always so tricky to get right.

you're not kidding

Is that a correct understanding? And that this PR ensure that implicit release hack works and it doesn't need to reacquire at the end?

I believe this is correct yeah. This is why the way cc is obeying the jobserver protocol is a bit wrong, processes should never literally release their own token but rather just juggle the implicit token between spawned work and themselves trying to spawn more work. This means that there can be a bunch of build script processes hogging memory (one of the biggest reasons for -j) which won't go away until other jobs complete. In the case of build scripts it's probably not that big of a deal though.

This PR basically ensures that the initial release for the build script actually works, because currently it doesn't on Windows. Most initial releases work but if you run -j8 and a build script is the only process running, then the jobserver actually has 7 tokens max and no tokens were ever read (the implicit one was passed around) so the initial release will fail. This PR, however, makes it so the 8th token was actually acquired at the beginning, so the build script can indeed successfully release it.

I was also wondering if it's worth reacquiring the token, but the cc crate does actually run ar which can be somewhat expensive so there's at least a tiny bit more than a clean exit. (that and cc is just a library, so a build script itself may actually do much more).

Should I update cc on rust-lang/rust?

Yeah I think it should be safe to update, this probably only affects Cargo's dependency graph and that's already tested on CI with the latest libgit2 and such!

@ehuss
Copy link
Contributor

ehuss commented Sep 11, 2019

Ok! 😄

@bors r+

@bors
Copy link
Contributor

bors commented Sep 11, 2019

📌 Commit cab8640 has been approved by ehuss

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Sep 11, 2019
@bors
Copy link
Contributor

bors commented Sep 11, 2019

⌛ Testing commit cab8640 with merge a0be578...

bors added a commit that referenced this pull request Sep 11, 2019
Create a jobserver with N tokens, not N-1

I recently added `jobserver` support to the `cc` crate and ended up
running afoul of a `jobserver` quirk on Windows. Due to how it's
implemented, on Windows you can't actually add more than the intial
number of tokens to the jobserver (it uses an IPC semaphore). On Unix,
however, you can since you're just writing bytes into a pipe.

In `cc`, however, I found it convenient to control parallelism by simply
releasing a token before the parallel loop, then reacquiring the token
after the loop. That way the loop just has to acquire a token for each
job it wants to spawn and then release it when the job finishes. This is
a bit simpler than trying to juggle the "implicit token" all over the
place as well as coordinating its use. It's technically invalid because
it allows a brief moment of `N+1` parallelism since we release a token
and then do a bit of work to acquire a new token, but that's hopefully
not really the end of the world.

In any case this commit updates Cargo's creation of a jobserver to create
it with `N` tokens instead of `N-1`. The same semantics are preserved
where Cargo then immediately acquires one of the tokens, but the
difference is that this "implicit token" can be released back to the
jobserver pool, unlike before.
@bors
Copy link
Contributor

bors commented Sep 11, 2019

☀️ Test successful - checks-azure
Approved by: ehuss
Pushing a0be578 to master...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants